xxxxxxxxxx# Auto _Ch 02 - Q9 (applied)_ __Description__ Gas mileage, horsepower, and other information for 392 vehicles.__Source__ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.__References__ This dataset is a part of the course material of the [book](https://www.statlearning.com/) : ___Introduction to Statistical Learning with R___ (Ch 02 - Statistical Learning - Applied Exercises - Problem 9)Ch 02 - Q9 (applied)
Description
Gas mileage, horsepower, and other information for 392 vehicles.
Source
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The dataset was used in the 1983 American Statistical Association Exposition.
References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R
(Ch 02 - Statistical Learning - Applied Exercises - Problem 9)
xxxxxxxxxx__Short description of variables__ - <b>mpg :</b> miles per gallon - <b>cylinders :</b> Number of cylinders between 4 and 8 - <b>displacement :</b> Engine displacement (cu. inches) - <b>horsepower :</b> Engine horsepower - <b>weight :</b> Vehicle weight (lbs.) - <b>acceleration :</b> Time to accelerate from 0 to 60 mph (sec.) - <b>year :</b> Model year (modulo 100) - <b>origin :</b> Origin of car (1. American, 2. European, 3. Japanese) - <b>name : </b> Vehicle nameShort description of variables
xxxxxxxxxx<a id='toc'></a>### Index- [1) Load packages](#1%29-Load-packages) - [2) Import Data](#2%29-Import-Data)- [3) Data preparation](#3%29-Data-preparation)- [a) Which of the predictors are quantitative, and which are qualitative?](#(a%29-Which-of-the-predictors-are-quantitative,-and-which-are-qualitative?)- [b) What is the range of each quantitative predictor?](#(b%29-What-is-the-range-of-each-quantitative-predictor?)- [c) What is the mean and standard deviation of each quantitative predictor?](#(c%29-What-is-the-mean-and-standard-deviation-of-each-quantitative-predictor?)- [d) Range, mean and standard deviation after removing observations 10-85](#(d%29-Range,-mean-and-standard-deviation-after-removing-observations-10-85)- [e) Graphical examination of predictors](#(e%29-Graphical-examination-of-predictors)- [f) Variables useful in predicting mpg](#(f%29-Variables-useful-in-predicting-mpg)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxlibrary(ggplot2)library(RColorBrewer)pacman::p_load(plotly, tidyverse, reshape2)library(IRdisplay)xxxxxxxxxx# save default options and parametersdefop = options()defpar = par(no.readonly=T)# function to modify plot parametersplot_pars = function(w=7,h=7) {options(repr.plot.width=w, repr.plot.height=h)}xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------x
# Import datafdir = r"(E:\Data Science\Statistics\Intro to Statistical Learning with R)"fpath = file.path(fdir,'datasets','Auto.csv')auto = read.csv(fpath)dim(auto)head(auto)# check for missing valuesanyNA(auto)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# structure of datasetstr(auto)xxxxxxxxxxThe fact that a column containing numbers (horsepower) has been saved as 'chr' is a red flag. This may happen when all the elements in a column are not numerical. This column will have to be further examined.The fact that a column containing numbers (horsepower) has been saved as 'chr' is a red flag. This may happen when all the elements in a column are not numerical. This column will have to be further examined.
xxxxxxxxxx# Rows with non-numeric valuesauto[which(is.na(as.numeric(auto$horsepower))),]# is.numeric(auto$horsepower) will not give boolean array but a single True/False value.xxxxxxxxxxSince the number of missing values is very small, those rows can just be deleted.Since the number of missing values is very small, those rows can just be deleted.
xxxxxxxxxx# Rows with ? in horsepowerwhich(auto$horsepower == '?')xxxxxxxxxx# Rows with ? in any columnauto[which(apply(auto, 1, function(x) any(x %in% c("?")))), ]xxxxxxxxxx# Deleting rows with ?auto = auto[-which(auto$horsepower=='?'), ]sum(auto=="?")dim(auto)class(auto$horsepower)xxxxxxxxxx# Convert horsepower to numericauto$horsepower = as.numeric(auto$horsepower)str(auto)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (a) Which of the predictors are quantitative, and which are qualitative?xxxxxxxxxx*Quantitative* → numerical values. *Qualitative* → values in one of K different classes, or categories.Quantitative → numerical values.
Qualitative → values in one of K different classes, or categories.
xxxxxxxxxx# No. of unique values in columnssapply(auto, function(x) length(unique(x)))xxxxxxxxxx| variable | description | variable type| :--- | :--- | :---| mpg | miles per gallon | quantitative| cylinders | Number of cylinders between 4 and 8 | qualitative or categorical| displacement | Engine displacement (cu. inches) | quantitative| horsepower | Engine horsepower | quantitative| weight | Vehicle weight (lbs.) | quantitative| acceleration | Time to accelerate from 0 to 60 mph (sec.) | quantitative| year |Model year (modulo 100) | quantitative| origin | Origin of car (1. American, 2. European, 3. Japanese) | qualitative or categorical| name | Vehicle name | qualitative or categorical"year" can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).| variable | description | variable type |
|---|---|---|
| mpg | miles per gallon | quantitative |
| cylinders | Number of cylinders between 4 and 8 | qualitative or categorical |
| displacement | Engine displacement (cu. inches) | quantitative |
| horsepower | Engine horsepower | quantitative |
| weight | Vehicle weight (lbs.) | quantitative |
| acceleration | Time to accelerate from 0 to 60 mph (sec.) | quantitative |
| year | Model year (modulo 100) | quantitative |
| origin | Origin of car (1. American, 2. European, 3. Japanese) | qualitative or categorical |
| name | Vehicle name | qualitative or categorical |
"year" can be considered to be quantitative in the sense that it could indirectly reflect the impact of technological abilities of the times, otherwise it can be considered qualitative (categorical).
xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (b) What is the range of each quantitative predictor?xxxxxxxxxx# Quantitative predictorsquant_data = auto[, setdiff(names(auto), c('cylinders','origin','name'))]names(quant_data)xxxxxxxxxxsumm = data.frame(sapply(quant_data, range), row.names=c('min','max'))summ['range',] = summ['max',] - summ['min',]summxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (c) What is the mean and standard deviation of each quantitative predictor?xxxxxxxxxxsumm['mean',] = sapply(quant_data, mean)summ['sd',] = sapply(quant_data, sd)round(summ, 2)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### (d) Range, mean and standard deviation after removing observations 10-85xxxxxxxxxx# Removing rows 10-85df1 = quant_data[-seq(10,85), ]dim(df1)# Check if the 10th row in filtered dataset matches 86th row of unfiltered datasetall(df1[10,] == quant_data[86, ])xxxxxxxxxxsumm = data.frame(sapply(df1, range), row.names=c('min','max'))summ['range',] = summ['max',] - summ['min',]summ['mean',] = sapply(df1, mean)summ['sd',] = sapply(df1, sd)round(summ, 3)xxxxxxxxxx# Check sumsapply(df1, sum)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Pairs plot panelspanel.hist <- function(x, col="thistle", ...){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col = 'whitesmoke', ...)}panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits = digits)[1] txt <- paste0(prefix, txt) if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r)}xxxxxxxxxx# Hue - cylindercol_map = setNames(c('#F1DCD7','#E3BAC0','#CD9BB1','#AE80A3','#61566E'), sort(unique(auto$cylinders)))cols = col_map[unlist(as.factor(auto$cylinders))]plot_pars(17,15)pairs(quant_data, cex=2.5, cex.labels = 2, cex.axis=1.6, pch=21, bg=cols, col='whitesmoke', lwd=0.3, diag.panel=panel.hist, lower.panel=panel.cor, oma=c(3,3,15,3))par(xpd = TRUE)legend(x=0.25, y=1, fill=unlist(col_map), legend=names(col_map), horiz=T, bty="n")options(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id=''></a><b>Observations:</b><br> - acceleration appears to be roughly normally distributed.<br> - Increase in no. of cylinders leads to <br>  • lower : mpg, acceleration<br>  • higher : displacement, horsepower, weight<br> - There has been a decline in the no. of new models coming out with 8 cylinders.<br> - Newer models are lighter and and have loss horsepower (presumably because of decreased weight).<br> - mpg appears to have strong (non-linear) relationships with displacement, horsepower, weight and is negatively correlated with the 3 variables.<br> - mpg of new models has imporoved over the years.<br> - displacement, horsepower and weight appear to have a strong positive correlation with each other.<br> - A moderate negative correlation may exist between horsepower and acceleration.<br></div># Converting column cylinder to factor before using for 'color'auto$cylinder = as.factor(auto$cylinder)# Scatter plot - Cylinders as huepal = c('#fdc086','#386cb0','#beaed4','#33a02c','#f0027f')fig = plot_ly(data=auto, x=~year, y=~mpg, color=~as.factor(cylinders), colors=pal, type="scatter", mode='markers', marker = list(size=8), text = ~paste('mpg:',mpg, '<br>name:',name, '<br>origin:',origin, '<br>cyl:',cylinders), width=700, height=400) %>% layout(showlegend=T, legend = list(orientation="v", xanchor="center", x=1.05, y=0.9), title = "Scatter Plot", hoverlabel=list(bgcolor=pal), xaxis = list(title = "Year", zeroline=F, showgrid=F), yaxis = list(title = "mpg", zeroline=F, gridcolor="white"))options(jupyter.plot_mimetypes=c("text/html","image/svg+xml"))display(fig)xxxxxxxxxx# Scatter plot - origin as huefig = plot_ly(auto, x=~year, y=~mpg, color=~as.factor(origin), width=700, height=400, colors = brewer.pal(length(unique(auto$origin)),"Set1"), symbol=~origin, symbols = c('triangle-up','circle','x'), text = ~paste('mpg:',mpg, '<br>name:',name, '<br>origin:',origin, '<br>cyl:',cylinder), type="scatter", mode='markers', marker=list(size=8)) %>% hide_colorbar() %>% layout(showlegend = TRUE, legend = list(orientation="h", xanchor="center", x=0.45, y=1), hoverlabel=list(bgcolor="white"), title = "Scatter Plot", xaxis = list(title = "Year", zeroline=F, showgrid=F), yaxis = list(title = "mpg", zeroline=F, gridcolor="white")) %>% config(displayModeBar = FALSE)options(jupyter.plot_mimetypes=c("text/html","image/svg+xml"))display(fig)# https://rstudio-pubs-static.s3.amazonaws.com/448200_6bb02977b4c04e0da508ac0131f71d48.html# https://colorbrewer2.org/#type=qualitative&scheme=Set1&n=3# https://plotly.com/python/marker-style/#custom-marker-symbols# symbol=~origin, symbols = c('square','circle','x','+','x','triangle')# use symbol('^') to get list and names of acceptable symbols# config(displayModeBar = FALSE) >> hides plotly bar (https://plotly-r.com/control-modebar.html)# css colors >> https://www.w3.org/TR/css-color-3/#svg-color# hover format >> https://plotly.com/r/hover-text-and-formatting/# color=~as.factor(origin) shows the legend, then hide_colorbar() not reqdxxxxxxxxxx# Color mapcols = character(nrow(auto))cols[] = '#61566E'cols[auto$origin == 1] = '#EDD1CB'cols[auto$origin == 2] = '#AD6E94'# Hue - originplot_pars(15,15)pairs(quant_data, cex=2.2, cex.labels = 2, cex.axis=1.6, bg=cols, pch=21, col='whitesmoke', diag.panel=panel.hist, lower.panel=panel.cor, oma=c(3,3,16,3))par(xpd = TRUE)legend(x=0.4, y=1,, fill=c('#EDD1CB','#AD6E94','#61566E'), legend=c(levels(as.factor(auto$origin))), horiz=T, bty="n")options(defop)xxxxxxxxxx<div class="alert alert-block alert-info"><a id=''></a><b>Observations:</b><br> - Cars of European (2) and Japenese (3) origin can be seen to be overlapping in many criteria whereas American cars (1) have a larger and distinct spread.<br> - Clear distinctions can be seen between American and the other 2 carmakers in displacement, horsepower and weight.</div>xxxxxxxxxx# Frequency distribution of cylinderstable(auto$cylinders)xxxxxxxxxx# year-wise distribution of cars with cylinder countcyl_year = auto %>% count(year,cylinders) %>% spread(cylinders, n, fill=0)cyl_yearx
# year-wise distribution of cars with cylinder countdf <- melt(cyl_year , id.vars='year', variable.name='cylinders')p = ggplot(data=df, aes(x=year, y=value)) + geom_line(linetype='twodash', aes(color=cylinders)) + geom_point(aes(color=cylinders)) + theme_classic() + scale_color_brewer(palette="Paired") + theme(legend.title=element_text(size=10), legend.text=element_text(size=8))fig = ggplotly(p, width=650, height=350)fig = fig %>% config(displayModeBar=T) %>% layout(fig, plot_bgcolor='white', xaxis=list(showticklabels=T, tickmode="linear", tickformat='0', dtick=1, tickfont=list(size=10), zeroline=F, showgrid=F, color="black", title=list(text="year", standoff=10, font=list(size=15))), yaxis=list(showticklabels=T, tickmode="linear", tickformat='0', dtick=5, title=list(text="mpg", standoff=15, font=list(size=15)), gridcolor="white", showticklabels=T, tickfont=list(size=10)))options(jupyter.plot_mimetypes=c("text/html","image/svg+xml"))display(fig)xxxxxxxxxx# setup 1 - classic ggplot themedf <- melt(cyl_year , id.vars='year', variable.name='cylinders')plot_pars(7,3)fig = ggplot(data=df, aes(x=year, y=value)) + geom_line(linetype='twodash', aes(color=cylinders), size=0.8) + geom_point(aes(color=cylinders), size=2) + theme_classic() + scale_color_manual(values=c('#fdc086','#386cb0','#beaed4','#33a02c','#f0027f')) + scale_x_continuous(breaks = seq(min(df$year), max(df$year), by=1)) + scale_y_continuous(breaks = round(seq(min(df$value), max(df$value), by=5), 1)) + theme(text=element_text(size=13), panel.background=element_rect(fill="white"), panel.grid.major.x=element_blank(), panel.grid.major.y=element_blank(), panel.grid.minor.x=element_blank(), panel.grid.minor.y=element_blank(), axis.text.x=element_text(color="grey20", size=9, angle=0, hjust=.5, vjust=.5, face="plain"), axis.text.y=element_text(color="grey20", size=9, angle=0, hjust=1, vjust=0, face="plain"), axis.title.x=element_text(color="grey20", size=12, angle=0, hjust=.5, vjust=0, face="plain"), axis.title.y=element_text(color="grey20", size=12, angle=0, hjust=.5, vjust=.5, face="plain"))figxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxExcept for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative. <br>_Positive_ : year<br>_Negative_ : cylinders, displacement, horsepower, weight<br>_Non-directional_ : origin<br>They can be taken into account for predicting mpg, after adjusting for collinearity.Except for acceleration, all the varibles display some sort of relationship or trend with mpg, whether positive or negative.
Positive : year
Negative : cylinders, displacement, horsepower, weight
Non-directional : origin
They can be taken into account for predicting mpg, after adjusting for collinearity.